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i-rt ' The nervous system solves a wide variety of problems in signal processing. 

P^ I In many cases the performance of the nervous system is so good that it appo- 

O ■ raches fundamental physical limits, such as the limits imposed by diffraction 

and photon shot noise in vision. In this paper we show how to use the language 

of statistical field theory to address and solve problems in signal processing, 

that is problems in which one must estimate some aspect of the environment 

V^ ' from the data in an array of sensors. In the field theory formulation the op- 

C^ I timal estimator can be written as an expectation value in an ensemble where 

the input data act as external field. Problems at low signal-to-noise ratio can 
be solved in perturbation theory, while high signal-to-noise ratios are treated 
with a saddle-point approximation. These ideas are illustrated in detail by an 
example of visual motion estimation which is chosen to model a problem solved 
by the fly's brain. In this problem the optimal estimator has a rich structure, 
adapting to various parameters of the environment such as the mean-square 
contrast and the correlation time of contrast fluctuations. This structure is in 
qualitative accord with existing measurements on motion sensitive neurons in 
the fly's brain, and we argue that the adaptive properties of the optimal esti- 
mator may help resolve conlficts among different interpretations of these data. 
Finally we propose some crucial direct tests of the adaptive behavior. 



1 Introduction 

Imagine walking along a busy city street. As we walk, we are almost unaware 
of the myriad tasks which our brains are performing: We use a combination of 
visual and vestibular signals to keep ourselves upright, sensors in our feet and 
leg muscles help adjust our stride to the terrain, we listen for cars and people 
behind us, vision helps us recognize the cafe we are approaching and identifies 
our friend who will meet us there, and all these senses combine to provide us 
with a trajectory which reaches our goal and avoids obstacles. All of these 
tasks involve signal processing, and we have the qualitative impression that we 
(and other animals) are quite good at solving these problems. The goal of this 
paper is to show that statistical mechanics provides the natural language for 
formulating and solving signal processing problems, and that the structure of 
the statistical mechanics models provides a predictive theory of how real brains 
solve the corresponding problems. 

1.1 Physical limits and biological signal processing 

Signal processing is in essence a physics problem. As an example, in vision the 
precise formulation of any task must begin with the fact that images are blurred 
by diffraction and corrupted by photon shot noise. These irreducible limitations 
in the quality of the input signal limit the reliability with which the brain (or 
any device) can estimate what is really going on in the visual environment — 
we can never be truly certain of what we are looking at, nor can we know its 
exact trajectory, and so on. There are physical limitations from the hardware 
as well — cells of a certain size are bound to generate electrical noise, signals are 
passed from one point to another along cables of limited information capacity, 
... . Remarkably, these physical limitations to the reliability of computation 
are actually relevant to the operation of real brains. Indeed there are many 
cases where the performance of the nervous system is essentially equal to the 
limit that one calculates from first principles M, ; examples range from photon 
counting in toads and humans O 0| to visual motion estimation in flies ItI E3to 
acoustic coding in frogs and crickets [^ and echo delay estimation in bats |}48| . 
These observations strongly suggest that a theory of optimal signal processing 
will help us understand the computational strategies of brains. 

1.2 Choosing a problem 

Of the many examples of signal processing in the nervous system, our discussion 
is motivated most directly by the problem of motion estimation in the fly's visual 
system. In many insects the visual system produces movement signals which are 
used to control the flight muscles and thereby stabilize flight. In the case of flies, 
visually guided flight behavior has been studied both by measuring flight paths 
during natural behaviors [pO| pi , 52 , dS and by examining the torques produced 



by flies hanging from a torsion balance in response to movements of controlled 
patterns across the visual field |^, |3^ . The input to the motion computation 
comes from a single class of photoreceptor cells which are arrayed in the regular 
lattice of the compound eye, and the signal and noise properties of these cells 
are extremely well characterized. The output of the movement computation 
can be monitored in a handful of identified cells of the lobula complex |^, |l^ , 
and destruction of individual lobula plate neurons produces clear deficits in the 
fly's opto- motor behavior |l^. The fact that these neurons are "identified" 
has a technical meaning: cells of essentially identical morphology occur in each 
fly, and these cells have responses to visual stimuli which are quantitatively 
reproducible from individual to individual. Thus the cells can be named and 
numbered based on either structure or function; under favorable conditions one 
can record from the cell HI, which codes wide field, horizontal movements of the 
visual field, for periods of up to five days Q . The accessibility of quantitative 
measurements at each of several layers in the nervous system — photoreceptors, 
second order neurons, motion-sensitive cells, flight behavior — makes the fly a 
nearly ideal testing ground for theories of neural signal processing. 

1.3 Relation to previous work 

Signal processing is of course an enormous field with a diverse set of applica- 
tions [^ |3^, Q . At the core of this field is the description of random func- 
tions, whether they be functions of time (e.g., sounds) or space (e.g., images). 
Nonetheless, the bulk of the literature on signal processing does not seem to 
make extensive use of the functional integral methods which seem so natural 
from the perspective of statistical mechanics. In 1988, however, Zweig and Lack- 
ner p8| studied a model signal processing problem, reconstructing the configu- 
ration of a randomly moving plate using a set of sparse and noisy observations 
of displacements at particular points along the plate. By choosing the random 
motions from the Boltzmann distribution they emphasized the connection to 
statistical mechanics. 

Traditional approaches to the problem of "optimal estimation" begin by 
defining a class of estimation strategies and then use variational methods to 
search this class for the best estimator. This approach goes back to Wiener and 
Kolmogorov, who found the optimal linear filters for recovering and extrapo- 
lating signals in noisy backgrounds. In this analysis the causality of filters is a 
crucial constraint, and hence the key mathematical ingredient is the role of the 
causal analytic structure of filters in solving the integral equations which result 
from the variational definition of optimality. The Wiener-Kolmogorov results 
essentially close a broad class of questions related to optimal estimation by linear 
filters, but leave completely open problems where the optimal estimator may be 
a non-linear function of the input data. In fact, the classical approach gives us 
little hint about the conditions for the importance of non-linearity. We shall see 
that the statistical mechanics approach gives us tools for going beyond linear (or 



lowest order non-linear) estimators. Rather than searching through a limited 
class of estimators for a local optimum, the mapping to a statistical mechanics 
problem allows us to directly construct the globally optimal estimator. 

In the case of vision, it is widely recognized that unconstrained variational 
approaches to estimation result in ill-posed problems Q . The "regularization" 
of these ill-posed problems can be given a probabilistic interpretation, in which 
the optimal estimator reflects a compromise between fitting the available data 
and taking account of a priori knowledge about the expected distribution of 
incoming signals; this is the same structure described by Zweig and Lackner 
in their model problem. Different regularization terms thus represent different 
hypotheses about the statistical structure of the visual environment, but there 
has been very little experimental work to characterize these statistics [Q . Fol- 
lowing Geman and Geman [Q, there has been considerable focus on "Markov 
random field" models for image statistics, which are known to be equivalent to 
Boltzmann distributions with local Hamiltonians, but there has been relatively 
little work exploiting the full statistical mechanics structure of even these simple 
models. In particular, if one takes the statistical models seriously as models of 
the problems that are solved by real brains, there is the clear prediction that the 
signal processing strategies of the brain must adapt to the statistical structure of 
the environment. This sort of adaptation is much richer than that convention- 
ally described in the biological literature, and we shall explore these predicted 
effects in some detail since we feel that they constitute the key experimental 
signature of optimal estimation. 

The connection between statistical mechanics and signal processing has been 
used in previous work to study vision at very low photon flux, where the low 
signal-to-noise ratio (SNR) leads to a universal "pre-processing" stage in any 
estimation task [pi R% p3 . The structure of this pre-processing step is in good 
agreement with the responses of cells in the first stage of retinal signal process- 
ing, and as far as we know this is the first example of a successful parameter-free 
prediction of neural responses. Most sensory systems encode incoming signals 
in sequences of discrete pulses, termed spikes or action potentials, and the prob- 
lem of decoding these pulse sequences can again be given a statistical mechanics 
formulation ||9|] ; the resulting predictions concerning the algorithms for optimal 
decoding have been confirmed in experiments on a wide variety of organisms 
[^ . Finally, the low SNR limit of motion estimation in fiy vision has also been 
studied ||, |4[ Q , and this analysis was crucial in establishing that the fiy does 
make optimal estimates, but we shall see that by restricting attention to low 
SNR one misses a large amount of structure which is likely relevant under nat- 
ural conditions. Attempts to go beyond the low SNR perturbation theory have 
been confined to model problems M, nV . 



2 Simple Examples of Signal Processing 

A typical signal processing problem consists of extracting information from the 
output of a device corrupted by noise. The signal might be encoded in the data 
in a complex way or might even be present only statistically (as in the case of 
prediction of time series). The task is then, given the statistical properties of the 
quantities of interest, to find the optimal estimate of the signal using the data. 
To do so, one must decide on a definition of the quality of an estimate. There 
are two obvious choices: maximum likelihood and least-mean-squarc error. The 
maximum likelihood estimate corresponds to the signal that has the highest 
probability given the observation of the data. The least-mean-square estimator 
is the one that would on average minimize x^, the distance square to the true 
value; it is equal to the conditional mean. There are other cost functions whose 
minimizations lead to different estimators. Before we proceed with our main 
problem, let us explain our approach on a few simple models where everything 
can be computed exactly. 



2.1 One variable estimation 

Consider the following problem: estimate a signal consisting of a single number 
s using the output y of a detector contaminated by noise, y — s + r]. Since the 
noise is random and the true signal is unknown to the observer, this problem is 
a probabilistic one. To find the best estimate of the signal, we need to construct 
the probability of s given the output y. We use Bayes' theorem to write down 
the conditional probability of s given y: 



P[s\y] = 



P[y\s]P[s] 

P[y] 



(1) 



P[y|s] describes the detection process: how the output is related to the 
input signal and the probabilistic nature of the noise. P[s] reflects the a-priori 
knowledge of the signal, and P[y] is independent of s and thus serves as a 
normalization. 

Suppose first that both signal and noise are chosen from independent Gaus- 
sian distributions with variance S and N respectively, i.e. 
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After a little bit of algebra, we find 



and P[y\.^] = Pvp -A£ L. . (2) 
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The conditional probability distribution is thus also Gaussian with mean „ ,^ 
and variance Sv^ ■ In this example, the maximum and the average of the 



conditional probability distribution are equal and are given by 

S 



S + N 



y- (4) 



The best estimate of the input is thus a linear function of the detector output. 
This will be true in general whenever the probability distributions are Gaussian. 
It is not necessarily the case when either the signal or the noise is non-Gaussian 
or when the output is a non-linear function of the signal. 

To illustrate the emergence of non-linear estimators, consider a small mod- 
ification of our toy-model. Suppose now that the signal is positive and taken 
from an exponential distribution with mean 1/a and that the noise is Gaussian 
and additive but has a variance proportional to the signal (i.e. N = /3s). The 
analog of Eq. (||) is now 

PW„| = _l,„p(-„.-i^). (5) 

where Z is an s-independent normalization factor. The maximum likelihood 
estimator, which maximizes Eq. (o), is given by 



'" 2(1 + 2al3) 



(6) 



To compute the least mean-square estimator, we need to average s with 
respect to P[s|y]. After some algebra and a trip to our favorite integral table 
we find 

(7) 



„LMs _ /3 , |y| 
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In this example, the two definitions of optimal estimation lead to different 
results; both are non-linear functions of the output variable y. This of course 
comes from the non-Gaussian nature of the signal and the dependence of the 
noise upon the signal. Amusingly, in this case the optimal estimator is not even 
monotonic in the detector output y. 

2.2 Causal linear filtering 

Let us now consider the estimation of continuous signals by doing the time 
dependent version of our Gaussian toy-model. Suppose now that s{t) and ri{t) 
are functions of time chosen from Gaussian distributions with power spectra 
S{u}) and N{uj). In functional integral language M, p5{: 



1 

P[s] = ^- cxp 
Zs 



1 /-^ duj \s{uj)f 



2 J_^ 27r S{lo) 



(8) 



where Zs is an ill-defined normalization constant which as usual never enters in 
the computations. There is a similar equation for P[2/|s] . We find the conditional 
probability distribution 



(9) 
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To find {s(Lu))y we compute the "equation of motion" and find 

which can be written in the time domain: 

se{t) - r dTg{T)y{t - r) where g{r) ^ H |^e— f[^^ (11) 

J-oo J-oo 27r S'(w)+iV(w) 

We see that the problem of estimating continuous functions of time is just the 
problem of estimating the independent Fourier components; for each component 
we have an equation identical to Eq. (0) for the estimation of a single variable. 
It should be clear that this simplicity is tied to the assumption of Gaussian 
distributions for both the signal and the noise. 

The filter g{T) is acausal, since it extends symmetrically both in the past 
and in the future. If we actually want to build a device which takes the y{t) 
as input and returns an estimate of the signal s{t), then this device must be 
causal. It is natural to ask what is the optimal causal estimator, one that would 
required knowledge of the output y{t) only up to the present or, more generally, 
up to a small delay time Tq in the future. The delay time Tq could be negative, 
that would be a predictor instead of an estimator, but the same method would 
apply. This problem was solved by Wiener pm and Kolmogorov pq, 0^. We 
give here another derivation in a slightly different context. Wiener assumed 
linear filtering and found that the optimal causal filter only required knowledge 
of the signal and noise power spectra. We will assume that the signal and noise 
are chosen from Gaussian distribution with specified power spectra and find 
that the optimal estimator is a linear filter. 

The task is to estimate s(— Tq) given the output up to time t = 0. Once again 
we write our estimator as the conditional average of the random variable s(— To) 
upon observing y{t < 0). This is equivalent to averaging first with respect to 
the conditional probability of s given all of y and then integrating out y^(t) 
using the conditional probability P[y^ {t)\y^ (t)] where 

(12) 
We need to write down the probability distribution of y^{t) given y^{t). 

Pb-(')l!/-Wl^;j3H(| (13) 
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P[y(t)] = -exp 
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The key trick pS is to write 
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(14) 
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where \['(a;) is chosen such that it has no poles in the upper half plane. The 
probability distribution in the time domain is then 



P[y{t)] = ^exp 



dT 



duj 

2^ 



Z/(a;)vI/(c.)e— 



(16) 



The function x{t) defined by 



x{t) 



duj 

2^ 



v{u;)-^{Lo)e~'^^ 



(17) 



is causal (i.e. it only depends on the values of y{t) for t < t). 



p[y ^y^^ = z '^^P 



dr 



—y {u)-^{u)e 
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(18) 



(19) 



To do the integral over y'^(t) we change variables to x~^{t) defined as x{t) 
for T > 0. We notice that this is a linear transformation, and since x{t) is causal 
the transformation matrix is lower triangular. Therefore the Jacobian of this 
transformation is independent of y~{t). The Gaussian integral over x(r) gives 
an overall factor that can be absorbed in the normalization of our probability 
distribution. 



P[y ] = ^ exp 



dT 



°° dijj _ 

—y {Lu)^{Lu)e 



(20) 



P[y+(t)|y-(i)] = -exp 



dT 



^(2/-(c^) + y+(c^))*(c^)e- 



(21) 



As mentioned earlier, we first average s using -P[s|j/] and then average the re- 
sult with respect to P[y'^\y^]. The result of the first averaging is the acausal 
estimator (O) evaluated at i = —To, 

/OO T 
^e'^-°y{u;)S{iu)'i'{iu)<i'{iu). (22) 

-OO ^TT 

Since Snc(— To) is linear in y{t), averaging with respect to ( pi|) amounts to re- 
placing y(t) by its average value which solves the equation of motion: 

^ (y-(c^) + y+{uj)) ^{Lo)e-'-^ = for r > 0. (23) 

Imposing ( p3| ) amounts to replacing y{Lu)^{uj) by 

dre-- y ^y(c^')*(^')e-'"'^ (24) 

in equation (P4). The causal estimator is therefore given by 
Sc(-ro)-/ -^e''^^"5(^)§(a;) / dre^M ^y(^')*(^')e"'"'M25) 



27r ./_oo 7-00 27r 



We can easily extend this to the slightly more general case where the output 
if known up to time t and we want to estimate the signal at time t — Tq. After 
rewriting things a little bit, we obtain 

/OC T 
^e-*'^*y(c.)fc(c^). (26) 

kiu;) = ^{ij) / dre'^^ / --e''"'("°-")5(cj')*(tt'') (27) 

Jo J-oo 27r 

We recognize this result as the Wiener filter ||5j, p. 86]. Note that we are 
using a different definition of ^(cj) and the opposite sign convention in Fourier 
transforms. But more importantly, our derivation shows that non-linear terms 
would not help the causal estimation of Gaussian signals with Gaussian noise. 

3 Optimal Rigid Motion Estimation 

We now turn to the problem of visual motion estimation. Although our primary 
motivation is to present a theory of the problem solved by the fly's visual system, 
our formulation is fairly general and should be applicable to any visual system, 
or even to the design of artificial systems. It is important to realize that our 
"model" is a model of the problem the fly is solving, not a model of the neural 
circuitry which implements the solution. Thus we need not concern ourselves 
with the details of fly neuroanatomy; what is important is that we capture the 
essential features of the fly's environment which make the problem of motion 
estimation non-trivial. 



3.1 The model 



The fly wants to estimate its own rotation as it tries to fly in a straight line 
but is pushed around by the wind. If the fly flew in a straight line it would 
see a contrast pattern C[x, t] that changes both as a function of time t and 
position X in the visual field; we define contrast as C = (/ — Io)/Io where / is 
the light intensity at a point and /q its average value. Since we are interested 
only in horizontal motion estimation we give a one-dimensional description, 
where x is just the azimuthal angle or yaw. When the fly turns along some 
angular trajectory 9{t) it sees a modified contrast pattern C[x — 0{t)^i\ then 
each photoreceptor cell produces a voltage given by 

Vn{t) = I T{t) I M{x,, - x)C[x - e{t -T),t-T]+ SV„{t), (28) 

where T{t) is the temporal impulse response of the cell, M{x) the angular 
acceptance profile of the photoreceptor and SVn{t) is the voltage noise. The 
photoreceptors form an array of size N and of angular separation 0o . We will 
assume that the noise is independent in each photoreceptor, Gaussian and ad- 
ditive with power spectrum N{uj). As in the previous section, we use Eq. (|l|) to 
write the conditional probability distribution where now 9{t) is the signal and 
{Vn{t)} are the detector output. Equation ( p8| ) and the voltage noise power 
spectrum determine the probability of observing {V^(i)} given a trajectory 9{t) 
and a contrast pattern C{x,t). Since this contrast pattern is unknown to the 
observer, one needs to average over all possible such contrast. 



p[e{t)m.{m 
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P[e{t)], (29) 



where 



Vn{uj) = T{uj) / dte"^' / dxM{x - Xn)C{x - e{t),t). 



(30) 



Brackets mean the average over P[C{x, t)], the spatio-temporal contrast distri- 
bution. For now we will keep the distribution P[C(x,i)] and P[9{t)] as general 
as possible, except that we will assume P[9{t)\ to be left-right and time reversal 
invariant. We will choose the optimal estimator to be the average 9{t) in the 
distribution (E3) which, as mentioned in the first section, minimizes x^ . We 
can expand the norm square in (Gfl) and drop the |V„p term since it does not 
depend on C(x, t) nor 9{t) and can be absorbed in the overall normalization, 
/ r I — 1 2 

1 I ■„,... -r-^ f du! \Vn{cd)\ 



.(i) 



ZiV) 



9{t) exp 






27r 2iV(o 



N{uj) 



c,e 



where 



Z{V) = exp 



E 



duj \Vn{uo)\ V*{uj)Vn{u}) 

2-K 2N{lj) N{uj) 



(31) 



(32) 
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The first term in the exponential modifies the a-priori independent action for 
C and 9 by introducing a coupHng between the two, while the second couples 
them to an external field Vn(w)- In this model, the optimal estimator is the 
expectation value of an operator within a statistical theory of two coupled fields 
(velocity and contrast) in the presence of an external field (photoreceptor volt- 
ages). This appears to be the general structure of signal processing problems 
from the statistical mechanics point of view — input data act as external fields, 
and optimal estimators are expectation values. Thus the construction of optimal 
estimators is exactly the problem of constructing response functions. 

3.2 Perturbation theory 

As always in field theory it is not an easy task to write down an exact expression 
for Eq. ( pl| ) . But fortunately there exists a potentially small parameter in which 
one can do perturbation theory, namely the signal-to- noise ratio (SNR). If the 
mean receptor voltages Ki(w) are small compared to the typical voltage noise 
(oc ^/WiuJ)) then (following Ref. |40 )we can expand the exponential in ( pl|) 



and write the estimator as a perturbation series in Vn'- 



^7 27r "^ ^ N{u) 

1^ fdu, fdcu' „ (g(^)v;r(^)K;(^0) 

2 2_.y ^y ^K(a^)K.(^ ) ^(^^^(^,) 

1^ rdco m\v*{io)\^) 

2 4-7 27T N{u;) 



(33) 



The first term vanishes because it is linear in contrast and the average contrast 
is zero by definition. The last term is also zero because it is odd under time 
reversal symmetry. To compute the second term it is useful to introduce a 
spatial Fourier representation; we find 

mv:{u;)V,ULu')) = T*(c.)T*(a;')/§|M(fc)|2e''=(— ) 

X f dT f dT'Sc{k,T~T')e''-'^^+'^'^'^ 

x(^(t)e-''=[''(^)-^(^')l). (34) 

where we defined the contrast two-point function Scik,t) as 

(C(fc, t)C{k',t')) = 27rS{k + k')Sc{k, t - t'), (35) 

assuming spatial and temporal translation invariance. Actually, spatial trans- 
lation invariance might not always hold. In the case of small angular jitter on 
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top of forward motion, there is a special point in the visual field, namely the 
direction of the forward motion. Relaxing this assumption amounts to having 
Sc depend on both k and k' which would result in having two k integrals in 
(p4|). This would not change significantly the conclusions of this section, but 
might obscure even more the formulae. We can't say much about the function 
F{k, t, T, t') defined by 

F{k, t, T, r') = (^(i)e-*'=[''(^)-^(^')l), (36) 

without being more specific about P[0{t)]. We will consider concrete examples 
in the next section. Note however that the form of F depends only on the 
statistics of the trajectories and that under our assumption of parity invariance, 
it must be antisymmetric in k. The optimal velocity estimator, to lowest order 
in SNR is therefore given by 

^'^W = E / ^ / ^9nrn{^,^')Vr.{^)Vm{i0'), (37) 



nni 



-<--') ^ 9^/|i-'«i^'"'---' 



X dr dT'e''-'^''+'^'^"^Scik,T~T')F{k,t,T,T'). (38) 



Poggio and Reichardt pointed out |^, ^^ that a velocity estimator, as any other 
functional, could be written as Volterra series in the photoreceptor output. They 
also argued that the simplest term that would give the proper directional sen- 
sitivity is an antisymmetric correlator. Here we have found that the dominant 
term of the optimal velocity estimator at low SNR is an antisymmetric correlator 
whose temporal and spatial characteristics are tuned to the statistics of trajec- 
tories, images (spatio-temporal contrast patterns) and photoreceptor noise. The 
only statistical feature of real moving images needed to compute this estimator 
is their spatio-temporal correlation function. The highly non-Gaussian nature 
of real images does not play a role in the estimation problem at low SNR. 

3.3 Saddle point evaluation 

In order to proceed, we need to be more specific about the distribution of trajec- 
tories. A natural choice is that 6{t) undergoes Brownian motion characterized 
by a diffusion constant D, 



P[0(<)] = |cxp 



-^l'^m')_ 



(39) 



If we assume that the temporal pre-filter T{uj) is invertible, we can use the 
variables C/„(i) defined by 

Unit)= lT-\T)Vn{t-T). (40) 
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Finally, we will neglect the effect of aliasing and replace the discrete photore- 
ceptor lattice by a continuum, Unit) —f U{x,t). It is useful to write things 
using the variables t and k conjugate to x. In those variables the coupling term 
U(k,t) is related to C{k,t) by a time dependent phase. 



C/(fc,t) = M(fc)C(/c,t)e*'=*W. 



(41) 



If the voltage noise is negligible, then the exponential in (p^) becomes a delta 
function and the functional integration over C{k, t) amounts to replacing C(k, t) 
by U{k, i)e~*'^^^*V-W(fc)j in other words we simply forget about the voltage noise 
and write U{k,t) = ij{k,t). To write the form of the estimator we still need a 
more specific distribution of contrast. We will assume that the action for C{x, t) 
is local in time and contains no more than a quadratic term in time derivatives. 



P[Cix,t)] = ^e^p 



dx dt - C^{x, t) + r(C, d^C, die, . . .) 



(42) 



With all these assumptions, we can write down the conditional probability 
distribution for 9{t), 



P[e{t)\U{x,t)] = |exp 



-— [dW^ 
AD J 

a f dk2ikU*Ue + P\U\^0^ 
2 J 2^ pWp 



(43) 



The potential term T{C, . . .) drops out because it is invariant under a time 
dependent coordinate change {x -^ x + 9{t)) that gets rid of the dependence on 
9{t). And the average value of 9{t) is given by the saddle point equation 



sW 



Jdxd^W{x,t)dtWix,t) 
Jdx[d,Wix,t)]^'- 



D- 



where 



Wix^t)^ /ge' 



' 2a{k) 

WW 



dx'e 



'U{x',t). 



(44) 



(45) 



In the last equation we have allowed a to depend on fc, as it is the case in the 
specific example we will consider in the following section. 

One can easily understand equation (44) intuitively. For a fixed pattern 
F{x) moving at a velocity io(i), the pattern seen by an observer at rest would 
be V{x, t) = F{x — Xo{t)). One could compute back the velocity by the formula 
Xo{t) — dtV /dxV. So to estimate a velocity, one can correlate spatial and 
temporal derivatives and normalize by spatial derivatives. If the pattern is 
itself changing in time, as a non-zero a{k) would produce, then there would be 
spurious motion coming from the variations in the image itself. The optimal 
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strategy in the presence of such noise is, as always, to reduce your estimate by 
something hke SNR (remember (^)). In this case the average ratio of spurious 
motion to real motion is related to the product of the two constants D and a. 
If the voltage noise is non-zero, equation (E3) will have to be modified. 
Since usually noise has a much shorter correlation time than the images, it 
improves the performance to integrate the signal over time to beat down the 
noise. The issue is a little bit more subtle than this since in our model the 
velocity also has zero correlation time but it induces a correlation time for each 
spatial frequency of the images equal to 1/k^D. One might expect that the 
optimal estimator at any SNR would look like a smeared out version of ([44|). 
We will find something that hints in that direction when we consider a slightly 
more specific model in the next section. To find corrections to (Q), one should 
not approximate the exponential in ( |3l|) and do a semi-classical approximation 
directly (which amounts to finding the most probable 9{t), in itself a valid 
estimator). Unfortunately when one wants to go beyond the N{uj) = limit, 
one is faced with integro-differential equations which do not lead to any simple 
closed form expression for 9e{t) as a functional of U{x, t). We spare the reader 
more details on these fruitless efforts. 

3.4 Specific model: Gaussian contrast and white noise 

In this section, we will give a particular form to the noise power spectrum and 
to the contrast probability distribution. This will allow us to compute each 
term in the low SNR expansion and to have a better understanding of what 
happens at higher SNR. For instance, in this model we can show that, in the 
noiseless limit, the perturbation expansion matches up term by term with the 
semi-classical result. 

The effective contrast noise power spectrum will be taken to be white, i.e. 
N{u})/\T{ijj)\'^ equals a constant {R^^). If photon shot-noise is the dominant 
source of noise, then this is the case and R is the effective photon counting 
rate. The white noise assumption allows us to write the action locally in time 



and therefore to get rid of the 6'-dependent phase (41) of the first term in the 
exponential in Eq. (pl|). 



S) = ^(^Wexp 



^ I'dt' f^\M{k)C{k,t'r 



2U{k,t')AI{k)C*{k,t')e~''''^^'">]) . (46) 



c,e 

We need to specify a probability distribution for the contrast C(a;, t). Picking 
a realistic one is by no mean an easy problem. Even the statistics of static 
images are hard to describe in simple terms [^. We saw in section |3.2| that, 
to lowest order in SNR, only the two-point function of C{x, t) enters in the 
calculation. Inspired by that fact, we will use a Gaussian distribution for the 
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probability of contrast. Equation (E6h adds two more terms to the effective 
action for C . The first is uncoupled and quadratic; therefore it only modifies 
the propagator without introducing any non-Gaussian interactions. The second 
couples C linearly to MUe'~'^^^ . Choosing P[C] to be Gaussian will allow us to 
do the C integral exactly. We will take the spatio-temporal power spectrum of 
contrast to be of the form 

which corresponds to a world where contrast fluctuations on an angular scale 
1/fc have a variance ~ S{k) and remain correlated for a time T(k). 
We can now do the functional integral over C{x,t) to flnd 



P[0(i)|L/(x,i)] = |exp 



-^ ldte\t)~H[e[t)-u{x,t)] 



(48) 



with the effective Hamiltonian given by 

H[e{t);U{x,t)] = -^ jdt^ jdt2 j^[Sc{k)U{kM)U{-kM) 

X g-|ti-*2l/T(fc)g-ifc[e(ti)-6/(t2)]\ (49^ 

where 



|2 



ScWHMWPScW l + ?^Mi^**«fi (50, 



|2 



f(.)^r(fc) l+^^'^^^)'^5^(^)-(^U ". (51) 

The second term within the braces in Eq. ( pOJ) and (|5^) is the fc-dependent 
photoreceptor signal-to-noise ratio. f(fc) is the relevant time constant at spatial 
frequency k. It measures how for back in the past and in the future extends 
the effect of the photoreceptor data on the optimal trajectory, or in simpler 
terms, how far in time should one look to be able to compute the optimal 
trajectory. This constant will appear at every term in perturbation theory 
and would appear in the exact result, if only it could be written in closed 
form. . . What is important, though, is that f{k) decreases with SNR. It starts as 
T(fc) when the SNR is very low. At very low SNR, it is reasonable to integrate for 
as long as the images are correlated to average over the noise, even at the expense 
of a loss in high frequency resolution. f(fc) goes to zero when the SNR becomes 



very large. As we saw in Section 3.3, the noiseless estimator is local in time. 
Estimating a velocity in the absence of noise amounts to correlating temporal 
and spatial derivatives, and derivatives are local in time. At intermediate SNR 
the optimum lies somewhere between these two extremes. 

15 



It is worth noting that the SNR dependence of f has a simple interpretation. 
In the case of hnear filtering to recover a signal from a noisy background (Section 
2.2), we saw that the optimal reconstruction filters the detector outputs so as 
to give near unit gain to frequencies where the signal-to-noise ratio is high and 
then rolls off to suppress frequencies where the signal to noise ratio is low [Eq. 
(4)] . Roughly speaking, the optimal reconstruction filter has a cutoff frequency 
at the point where the signal-to-noise ratio crosses unity. In our model where 
the power spectrum of the signal is S — \M{k)\'^S{k,uj) [from Eq. (46)] and 
the noise spectrum is A^ = 4>o/R, it is easy to see that for large SNR the cutoff 
frequency when S/N = 1 is just ujc = t~^ . Thus a filter with time constant 
f provides a cutoff which rejects frequencies with less than unit signal-to-noise 
ratio. As the overall SNR is increased, one can trust the photoreceptor signals 
out to higher frequencies. 

Since we have an explicit expression for the conditional probability distribu- 
tion, one might hope that we could find its maximum which would give us the 
maximum likelihood estimator. What is involved is solving 

This integro-differential equation could in principle be solved, but in practice 
there is no nice closed-form solution for general U{x,t). The limit R ^ oo oi 
"521) reduces to (g|) with a{k) = 2T{k)/Sc{k). 



As we did in Section 3.2, we can calculate the optimal estimator as a per- 
turbation series is SNR by expanding the interaction term, the exponential of 
the Hamiltonian (ES). Here the perturbation is a monomial, bilinear in U , so 
the v}^ term in the series will have exactly 2n powers of U . If we think of the 
exact result as a functional of U{x,t), then this series is also the Volterra series 
in U of that functional. For example, the first term, quadratic in U, can be 
viewed either as an approximation to the optimal estimator at low SNR or as 
the quadratic part (in the Volterra sense) of the exact result. 

To compute the perturbation series, we expand in a Taylor series the inter- 
action e^^ and take expectation values within the unperturbed theory, 

9,{t) = - {e{t)H{t)) + 1 (e{t)H{t)H{t)) - {mH{t)) {H{t)) + ... (53) 

The last term comes from expanding the normalization (e^^\. To compute the 
first term, we use the following identities: 

/g-jfc[e(ti)-e(i2)]\ ^ f,-^{(6(ti)-s(t2)f) ^ g-fc'citi-tsl^ (-54) 

= -2iDk[H{ti -t)- H{t2 - t)]e"'^'-°l*i-*^l. (55) 
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With those tools in hand, wc find the quadratic piece of the optimal estimator. 

O'^^Kt) ^ ^^ J ^tkSc{k) J dhj dt2U{k,h)U{-k,t2)e--^''^^'-~'^^ (56) 

where a{k) = k^D + f~^{k). Equation ( |56| ) describes an antisymmetric correla- 
tor. In Section o, we will discuss in more details the meaning of this result. 

Higher order terms in the perturbation series are easily computable using 
identities like ( pq ) . It is not difficult to write an expression for the general term 
of this series, it is only cumbersome and not very enlightening. The quartic 
term contains all of the qualitative features of the following terms (except of 
course that they will have more factors of U{x, t)). It is given by 

^i^lW = ^/|^(.fcr)^c(fc.)/f^c(fc3) 

/t /'OO /"CXD /"CXD 

dh / dt2 / dt3 / dUU(ki,ti)U{-ki,t2)U{k3,t3)U{-k3,U) 
-oo <J t >/— oo >/— oo 

X exp [-a{ki)t2i - a{k3)t43} (exp {kik^Dtusi] - 1) (57) 

where iij — \ti — tj\ and ^1234 = ^31 — ^32 — ^41+^42- Note that |ti234| is the overlap 
between the intervals [^1,^2] and [^3,^4] and is zero if the two intervals don't 
overlap. The values of {ti} that contribute are such that ii and ^2 are not too far 
from t (because of the exponential weighting), ts not too far from ^4 and [ii, ^2] 
overlapping with [^3,^4]. So all the {ti} must be close to t. The condition for 
overlapping intervals comes from the subtraction of the disconnected diagrams 
and it happens at all orders in the expansion. That means that if f(fc) goes to 
zero, as it does when SNR becomes large, these terms should become localized 
in time. If we let 7? ^ 00 in ( pq ) and (m) we find 

dk ., UU*T 



/dk 
2n \M\^Sc 



2 I dk UU*T f dk' ^^ |C/|^ 



This is the Volterra series of the noiseless estimator that we computed earlier 
( [l4[) with a{k) ~ T{k)/2Sc{k). This strongly suggests that our model can be 
solved by perturbation theory. 

With the help of ([44| ) we get a better understanding of the higher order 
terms in the expansion. The quartic term can be interpreted as the beginning 
of saturation, i.e. if the contrast is high enough the estimator's response is 
no longer proportional to the square of the contrast. The response to a step 
displacement (delta function of velocity) of the quartic term is opposite to that 
of the quadratic term. 
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3.5 Causal velocity estimation 



The low SNR estimator that we found in the previous section ( pdj) involves both 
the photoreceptor data from the past and from the future. A real fly, or any 
other physical realization of the estimator, might not be able to afford such a 
luxury. We might be tempted to simply truncate the ^2 integral to only allow 
for a small delay T in the estimation. If T is large compared with the time 
constants l/a(fc) then this procedure should be very close to optimal. But in 
general, this might not be optimal. 

To fit in the constraint of causality in our computation, we would like to do 



as in Section 2.2: solve the acausal problem and average over the conditional 
distribution of the unknown data given the observed data. The Wiener fac- 
torization trick only works for Gaussian distributions, or equivalently for linear 
filters, and is not easily generalizable to arbitrary distribution. Fortunately the 
quadratic estimator (bw is actually linear in the future data. Averaging over 
the values of U{x, t') for t' > t + T will then amount to replacing U{x, t') by its 
average value in the conditional probability P\U'^\U~]. This can be obtained by 
Wiener filtering and requires only knowledge of the power spectrum for f/(a;, t). 

{U{k,t)U{k',t')) = 2TTS{k + k') \^S{t - t') + 5,(fc)e-('^''^+""'('=»l*-*'l| (59) 



As described in Section |2.2| we write the reciprocal of this power spectrum as 
|^fc(a;)p where '^k{'-^) doesn't have any singularities in the upper half plane. The 
index k just goes along for the ride. We then write the "equation of motion" 
for the average value of [/+ given U~ (p3|). It is an integral equation whose 
solution is given by 



t/(fc,t) = (&-/3) 



dt'U{k,t')e'^'' 



where 
and 



b{k) ^k'^D + T-\k) 



-bt 



for t > 0, 



P{k)^b{k)Jl 



2R\M{k)\^Sc{k) 



h{k)4)o 



(60) 
(61) 
(62) 



We can now substitute ( |60D into (pq) evaluated at t = — T. Using time transla- 
tion invariance we can then write the causal estimator evaluated at any time. 



9SW 



2DR^ f dk 



27r 



I /■* /■*+^ 

ikSc{k){ dti dt2U{k,ti)U{~k,t2)e-''^'''>^'^''''^ 



^-^^^f4ll fdti rdi2t/(fc,ii)C/(-fc,t2)e'^('=)(*-*-^)-''W(*-'-^)]63) 
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To lowest order SNR, the different time constants a{k),b{k) and /3(fc) are all equal 
and the second term vanishes leaving us with the truncated estimator. At higher 
SNR, it is not clear what this term should become. If T were zero, it would 
be impossible to estimate the velocity since it has delta function correlations. 
When we put T = in the second term, we get identically zero from the anti- 
symmetry in k if we identify a(k) with /3(fc); these are not strictly equal but their 
difference might come from our uncontrolled approximations. The main result 
of this section is that the truncated estimator is the optimal causal estimator 
at low SNR. 

4 Characterization 

Imagine that we are presented with a physical device which we suspect may 
embody the optimal velocity estimator. How can we probe the system to see 
if our suspicions are correct? What are the key experimental signatures of the 
optimal estimator? One idea is that the optimal estimator is different in differ- 
ent regimes of SNR, so if the putative optimal estimator is designed to adapt to 
a wide range of conditions we might hope to trap the system in these different 
regimes and observe different responses to the same input signals. These 'adapt 
and probe' experiments p6[ are the analog of the standard pump-probe experi- 
ments in condensed matter physics. Another point is that the optimal estimator 
is, by definition, as reliable as possible given the input signals. We can try to 
quantify this reliability by directly measuring the noise in the estimator and 
then referring this noise back to the input as an effective input velocity noise. 
This is a standard procedure for characterizing noise in electronic devices [pSl 
and is also used to quantify the performance of neurons in real-time estimation 
tasks 0. 

4.1 Response function 

Let the optimal estimator be set up to match a world of given statistical char- 
acteristics (power spectrum, photon flux, ...). Rather than choosing contrasts 
C{x,t) from this distribution, wc want to probe the system with some stereo- 
typed stimulus. For simplicity we choose a static random pattern with spatial 
spectral density Sc{k) and we have this pattern make a small step displacement 
at time t = 0. We are interested in the experimentally observable average re- 
sponse of the causal estimator (pSh). By average, we mean here an average over 
the noise and over realizations of the pattern. The angular trajectory for a step 
at i = is simply 9{t) — 9H{t), H{t) being the Heaviside step function and 9 
a small angle compared to (pa and to the width of the photoreceptors optical 
profile. 

Since the noise has delta function correlation and the estimator doesn't cor- 
relate voltages at the same time, there is no noise contribution to the average 
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response. All we need to do is the contrast average and the i-integrals. 

on/?2 i-jh. _ ft i-t+T 

OTit) = -^ —^kScik) dh dh\M{k)\' {C{kM)C{kM)) 




X exp [ik9{H{h) - H{t2))] e-''('=)(*^-*i) (64) 

We are interested in the small 9 response, so we will expand the exponential of 
ik9 up to the first non-zero term. Substituting the average over C^ by its power 
spectrum and performing the i-integrals we find 

_ 2NDR^9 l'dk k^\M{k)\^Sc{k)Sc{k) 

■^ 27r a2(/c) ] (-^ _p~a{k)T\p-a{k)t^ 

(65) 

The three cases being respectively t < —T, —T < < < and i > 0. Unfortu- 
nately the last fc-integral can't be done analytically, but still the last equation 
gives us a good idea of the response. If T is short then the response rises very 
rapidly just before the step. Note that in a real implementation the rise would 
occur between the step and the delay time T. After the step, the response de- 
cays with a time constant related to f(fc). The response to the acausal quadratic 
estimator ( pq ) is also given by (pq ) with T — > oo. In other words, it is symmetric 
about zero and each k component rises and decays with time constant l/a{k). 
At high SNR, the integrated response is independent of the contrast power 
spectrum Sc{k). One can see this by looking at the time integral of (psh and 
noticing that it contains three powers of T(fc) and one of Sc{k) whose renor- 
malization factors (pQ) cancel the two powers of Sc{k). 

4.2 Equivalent input noise 

To quantify the performance of an estimator, we need to define a notion of noise 
in the estimate. We would like to say that the estimate is equivalent to the true 
signal to which noise has been added. This added noise corresponds to the 
combination of the external noise and of the noise introduce by the estimation 
process. Since least-mean-square estimators underestimate the signal, we want 
to correct for systematics. For example in the toy-model considered in Section 0, 
Eq. (0), the external noise has variance N but the chi-square of the estimate is 
N/{l + N/S). We have to first find the gain between the estimate and the signal 
and then compute the power spectrum of the difference between the signal and 
the estimate divided by the gain. In the engineering literature this quantity is 
referred to as the equivalent input noise power. For time dependent signals, the 
gain is a linear filter and is easily dealt with in frequency space. In the case at 
hand velocity is the signal. We write 

9,{Lj)=g{u;)(9{io)+r^{uj)), (66) 
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where g{uj) minimizes ( 9e{^) — 9{^)0{^) 



(67) 



We are interested in the power spectrum of rj which we find using the definition 
(pTI) and some algebra, 



(hHH = (l^HI 




I^MK l^eMI 



e{Lo)et{Lo) 



(68) 



All these expectation values are proportional to 2'K5{io = 0) which should be 
factored out when we compute the power spectrum. For the toy-model (|^) the 
variance of the equivalent input noise is equal to the external noise variance N. 



4.2.1 High SNR estimator 

The first estimator that we will characterize is the instantaneous estimator 
( p4 45) that was obtained by neglecting voltage noise. Although voltage noise 
is assumed to vanish, the remaining signals are not perfect indicators of motion, 
since the contrast C{x,t) can fluctuate even if there is no motion; these time- 
dependent contrasts set a limit to reliability of motion estimation even in the 
absence of true detector noise. As in Section 3.4 we will assume Gaussian con- 
trast distribution with power spectrum given by (|47|), with the only difference 
that we will specify the function T(fc) to be a constant t. The estimator is then 
given by 



.W = 



Dt ^!^ikW{k,t)W{-k.t) 

l + Z3T/gfc2|M^(fc,t)|2 



where W{k,t) 



U{k,t) 



^Scik)\M{kW 



(69) 



To compute the equivalent input noise using ( pq ) we need to take expectation 
values of products of Gaussian fields some of which appear in the denominator. 
It is unfortunately impossible to do so exactly but we can expand those opera- 
tors in their fluctuations around their average value, the former being typically 
smaller than the latter by a factor of iV, the number of photoreceptors. The 
factors of N will enter the calculation when we take the expectation value of 
the contrast evaluated at k and — fc. Formally this would be proportional to 
2'KS(k — 0) because we have used an infinite size approximation. For finite 
sample size, it should be proportional to the size of the system {N(j)o)- Another 
problem is the divergent /c-integrals for which we impose a hard cut-off at the 
sampling frequency (fc = ±7r/(/)o). After some straight-forward computations 
we find the input noise to first order in contrast fluctuations 



iVi 
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7r27V(4 -H r2w2) 



(70) 
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The two terms in the power spectrum of the noise are, as stated above, the 
respective contribution of the overall fluctuations in mean-square contrast and 
of the time-dependent changes in the image. If the correlation time of the images 
is not too short {(pl/Dr small compare to N) then the first term goes like 1/N^ 
and should be neglected to be consistent with our approximations. On the other 
hand if we only consider the numerator of (|69|), corresponding to a correlator 
without saturation, we can redo the calculations leading to ( |70|) and we find 
that the first term simply becomes 18D/5N. This term would be present even 
in the absence of image decorrelation (r — > oo). In other words, an unsaturating 
correlator is always subjected to noise coming from overall contrast fluctuations. 

4.2.2 Low SNR estimator 

The next estimator we want to characterize is the one given by the first term 
in the low SNR expansion (p7|). To be more specific we will use equation ( p6| ) 
obtained by considering white noise and Gaussian contrast. To lowest order 
in SNR, the renormalized time constant f(fc) and contrast spectrum Sc{k) are 
equal to their non-renormalized counterpart except for an obvious factor of 
|M(fc)p. Since in this section we will limit ourselves to the dominant noise 
contribution at low SNR, we will use the non-renormalized functions. We use 
equation ( |68|) to compute the equivalent input noise power spectrum, keeping 
only the dominant terms at low SNR. In particular we only include the noise- 
noise contribution of the estimator's auto-correlation. Schematically if 



JgVV (71) 

then 



NN 



- [9192 {SV16V1SV2SV2) (72) 



where SV is the voltage noise. We also drop the —1 term within the braces of 
( pq ) since it is of higher order in SNR. In this limit the calculation is rather 
simple because the voltage noise correlations are delta functions of time and the 
kernel g is given by the correlation of 9 and VV so both (|^eP) and {991) are 
proportional to the integral of g^ . If we compute these correlations precisely we 
get 

{Oe{t)eM)-^^J^ ^^ e , (73) 

where b{k) = k^D + T^-^{k) and with a similar expression for {99%). So the 
equivalent input noise power is given by 

0o Udk e\M{k)\'^si{k) 1 \-' 

iVi„p,t(a;)-^^^|y- ^^ 462(fc)+c.2/ " (^4) 
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Notice that the noise goes hke 1/A^ and is proportional to the square of the 
inverse effective photon counting rate. At low frequencies it is almost constant 
and at high frequencies it goes like uP' . Thus although we constructed the 
optimal estimator for the velocity ^, at high frequencies the effective noise level 
for estimating Q itself in frequency- independent. This feature of the optimal 
estimator is observed in experiments on flies ||7[. 

4.2.3 Renormalized correlator 



Finally we have shown in Section 3.4 that the renormalized correlator ( pq ) is 
the quadratic part of the Volterra series of the exact optimal estimator. One 
might want to take this estimator more seriously than just a low SNR estimator 
and see what its performance is at any SNR. To do so, one just has to compute 
all the correlation functions in the noise formula ( p8| ) . This is a rather straight- 
forward computation. We actually did it at any frequency (w), but the results 
were not easily presentable so we decided to only include them for a; = 0, and 
even then, we find it more proper to put them in the Appendix. There are a 
few points that are worth making. First, the low SNR result of the previous 
section can be recovered by considering only A4 and Aq in ( |8l| ) and letting 
Sc{k) -^ \M{k)\'^Sc{k) and f(/c) -^ T{k). Secondly, there is a term, namely 
^1, which doesn't decrease with the sample size N . As the SNR gets large and 
the correlator becomes localized in time, this term shrinks. It disappears in the 



SNR goes to infinity limit and we recover the result mentioned in Section 4.2.1 
for a noiseless optimal correlator. 

The fact that Eq. (|8^) has a term which is independent of TV is a bit strange. 
Intuitively we are looking for a motion signal which is coherent across all A'^ 
receptors, while noise sources are local and incoherent across the array, so that 
the optimal estimator must have a noise level which decreases as 1/A^. In fact one 
can prove this quite trivially by assuming that the estimated trajectory 9{t) is 
close to the true trajectory and expanding the effective Hamiltonian [Eq. (48)]; 
the "stiffness" of the system is extensive in N and hence the effective noise level 
is oc 1/A^. What then is wrong with the renormalized correlator? We saw that 
the correlator is just the first term in an expansion of the optimal estimator, 
and that this term is guaranteed to dominate only at low SNR. Indeed, at 
low SNR the correlator gives an effective noise level with the appropriate A'' 
dependence [Eq. (73)], but evidently the higher order terms play a crucial role 
in maintaining the correct A^ dependence at intermediate SNR. We now show 
that this is a symptom of a more general result: any correlator that integrates 
photoreceptor voltages for some finite time will generate a systematic error in 
its velocity estimate which cannot be undone by linear filtering, and hence an 
A^— independent term in the effective noise level. 

A general correlator has the form 

Oe{t)^ [ g{k,h-t,t2-t)V{k,h)V{-k,t2). (75) 
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The term in the noise (By) with the highest power of N arises when one consid- 
ers Wick contractions of the contrast field within the same 9e , which amounts 
to computing the noise in the contrast-averaged estimator. Note that any rea- 
sonable correlator has no average voltage noise contributions. 

£{6; t) = (Oeit))^ = N4>o j 9{k, h ~tM- t)Sc{k, t2i)e'^^'^''^-'^'-)^ (76) 

The point is that the function E is not linearly related to the velocity 0, which 
means that there is a systematic discrepancy between the estimator and the true 
velocity. To finish the proof, we make use of the following identity for functions 
of Gaussian variables: 

f XP XfT 

{F{X)') = {F{X))^ + jdtdt'{X{t)X{t')){j^^j^, (77) 

where X{t) is a Gaussian variable with zero mean. From this we can derive the 
following identity for the function E{9;t): 

[dt {E{0)E{t)) = 2D fdtdt' /^^5!0 \ (7g) 

i J \ S9{0) S9{0) I ^ ' 

In the last expression, we have used the explicit two-point function of 9 
and the fact that E averaged over all trajectories should be zero by symmetry. 
We have also used time translation invariance to shift the i-integrals. The A^- 
independent term in the noise at zero frequency is then 

^mnnt(^ = 0) = 2i:>^^ J-^ where G = fdt^^^. (79) 

mputv ,^,2 J ^g,Q. \ J 



^— wnere u = / ai — -. . 

{Gf J 69{0) 

We find the general result that the iV-independent noise piece is given by the 
normalized variance of the function G, which is the non-linear gain of the es- 
timator as a function of velocity. In the case of a general correlator (|75|), G is 
given by 

^dhdt2tkg{k,h,t2)Sc{kMi) die''^-[''(-*)-'^(*--*)l. (80) 

In general G will depend functionally on 9{t) and therefore will have a non-zero 
variance. The only case where G can be independent of 9 is if g{k,ti,t2) is 
zero for ii 7^ ^2, i-e. the correlator is instantaneous. This complete our proof 
that any correlator with non-zero support has an A^-independent noise piece. 
Although we don't have a formal proof of this, it appears that this systematic 
error cannot be undone by a time independent non-linearity at the output. 
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5 Comparison with experiment 

Let us suppose that the fly's visual system in fact embodies the optimal estima- 
tor whose structure has been discussed in the previous sections. How would we 
compare theory and experiment? We recall that the fly's estimate of angular 
velocity can be probed both in the behavior of the whole fly and in the response 
of individual motion-sensitive neurons. In each case one measures something 
which has different units than the angular velocity itself, so one must be careful 
in interpreting absolute quantities. The dependence of the response on various 
parameters of the stimulus, however, should be a robust prediction of the the- 
ory. Here we draw attention to some of these robust features, and comment on 
the comparison to experiments in the literature. 

• At both high and low signal-to-noise ratios, an essential element of motion 
estimation is the cross-correlation of photoreceptor outputs. At high SNR 
the correlation is between the spatial and temporal derivatives, suitably 
normalized, while at low SNR results the correlation is between spatially 
and temporally smeared versions of these derivatives. 

• The particular spatial and temporal filters which must be applied to the 
photoreceptor signals before computing the cross-correlation depend on 
the SNR, on the statistics of the trajectory 9{t) and on the statistics of 
the visual world defined by P[C{x, t)]. Thus the optimal motion estimator 
is an adaptive correlator. 

• As the SNR becomes large, the optimal estimator crosses over smoothly 
to being a comparison between spatial and temporal derivatives, as in 
Eq. ^ . This smooth crossover is related to the fact that our statistical 
mechanics problem does not have a phase transition, so that in some sense 
the motion estimation problem is solvable in perturbation theory. 

Remarkably, all of these features of the optimal motion estimator have cor- 
relates in experiments on real flies. 

5.1 Correlation 

The idea that insects estimate motion by a correlation scheme dates back forty 
years, to the classic work of Hassenstein and Reichardt [|^ . There is an enor- 
mous body of evidence that fly optomotor behavior can be described at least 
approximately by a correlation model [|2|, ^ [S^l , and the same can be said for 
the responses of HI and the other movement sensitive neurons [lO[ |5q| . The key 
experimental test is the demonstration that the response of the motion-sensitive 
system depends quadratically on the stimuli to individual photoreceptors, and 
indeed this is observed for low contrast stimuli. 

Poggio and Reichardt |p3, R6| tried to cast the problem of motion estima- 
tion in a more general formalism, which consists essentially of a Volterra series 
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expansion of the functional which relates the receptor signals to the estimator. 
They emphasized that the 2nd order correlation term is the simplest term in 
this series which can give an estimate which is related to the real motion signal. 
What was unclear in this formulation is why the system should use the simplest 
term. Further we know that the system is not exactly described by the corre- 
lation model, for example at high contrasts, so in this formulation it appears 
that flies do something more complex than they "need" to do. The statistical 
mechanics approach shows us that the series expansion in the Poggio-Reichardt 
work is justifiable as an expansion in signal-to-noise ratio, that the first term 
dominates only at low SNR, but there is a well-defined limit at high SNR which 
is also surprisingly similar to the simple correlation scheme. It is known that 
the impulse response of the correlator model mS) as the same qualitative fea- 



tures as the response of the cell HI to a step displacement [^ |49| . Perhaps 
more importantly, the spatial and temporal filters which are arbitrary from the 
earlier point of view are completely determined if the fly is to build the optimal 
estimator. 

5.2 Adaptive filtering 

This brings us to our second point, namely that the spatial and temporal fil- 
ters must adapt to the statistics of the environment. Correlations should be 
computed not just between nearest neighbors but over some range of distances 
which depends on the signal-to-noise ratio. There is direct evidence for distant 
neighbor correlation both in fly behavior |ll| and in the responses of HI [^ [f7| , 
and the relative weights of correlation at different distances changes with back- 
ground light intensity in qualitative accord with theory. By measuring the 
transient responses of HI one finds that the time constants of the filters which 
precede the correlation computation can adapt over a range of more than one 



order of magnitude 1 32 , 4^, ^ , and this adaptation is determined locally on the 
scale of at most a few photoreceptor spacings, as would be required for optimal 
processing in an inhomogeneous environment. To date this adaptation has been 
probed using constant velocity motion of rigid patterns, and other deterministic 
stimuli. It is clear from the theory, however, that the optimal processor should 
adapt its filtering to the correlation time and variance of random input signals. 
This is an important prediction because adaptation in the nervous system is 
usually described as the gradual fading away of responses to constant input, as 
in the familiar example of light adaptation where we are first blinded by a bright 
light and gradually recover sensitivity to small amounts of contrast around the 
large mean level; thus classical neural adaptation involves adaptation to the first 
moment of the stimulus distribution. Here we predict that the system instead 
changes its dynamics in response to a higher-order statistical feature, the cor- 
relation time. The statistical framework we have proposed for signal processing 
makes the clear qualitative prediction that the optimal processor must adapt to 
the ensemble of input signals, not just to the mean level, and this prediction is 
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independent of almost all details. It is thus important to test for this statistical 
adaptation in real neurons. 

5.3 Beyond correlation 

Finally, the saturation behavior of the optimal estimator points to the essence 
of current controversies in the literature on visual motion detection. As far 
as we know, most discussions of motion estimation begin by pointing out that 
there are two qualitatively different approaches (see, for example, Ref. pl|). 
In the first, inspired by experiments on insects, one correlates the responses 
of neighboring receptors, while in the alternative scheme one compares spatial 
and temporal gradients. Certainly these seem like very different algorithms — 
in one case the essential non-linear operation is multiplication, in the other 
case division. Furthermore, taking ratios of gradients provides an invariance to 
changes in overall contrast or spatial pattern, while Reichardt's original corre- 
lation scheme does not measure a true velocity but rather confounds angular 
velocity with the contrast and spatial structure of the image. It is tempting to 
suggest that "simpler" animals (like flies) arc limited in what they can compute, 
and therefore use the correlation scheme despite its difliculties, while higher an- 
imals (like us?) have more computational power and hence derive unambiguous 
estimates of motion. The experimental situation is far from clear. Many of the 
models designed to account for human perceptual performance can actually be 
reduced to variations on the correlation model ||2l| , while at least one group 
[ P4 p4| has drawn attention to the aspects of insect vision which do not con- 
form to the correlation scheme and suggested instead that flies do something 
more closely approximated by gradient comparison model. Finally, under sta- 
tistically stationary conditions one can decode the responses of the fly's motion 
sensitive neuron HI and recover an unambiguous if slightly noisy estimate of the 
unknown, time varying angular velocity signal H , indicating that at least under 
these conditions the fly's brain does not confound motion with other aspects of 
the visual world. Can the theory of optimal estimation help us resolve these 
apparent conflicts? 

Perhaps the most important point is that correlation and gradient compar- 
ison are not qualitatively different approaches to motion estimation, but rather 
different limits of a smoothly varying family of algorithms adapted to different 
signal-to-noise ratios. If flies (or humans, for that matter) perform optimal mo- 
tion estimation in environments which span a wide range of SNR, then they 
will sometimes appear to use the correlation scheme and sometimes appear to 
use gradient comparison. The pure versions of these models are valid only at 
infinitesimal or at infinite SNR respectively, and at least in the simple "model 
worlds" we have considered the crossover between these limits is smooth. Thus 
the optimal motion estimator should show the classic quadratic dependence on 
contrast in a low contrast (and hence low SNR) world, then smoothly saturate 
to a contrast-independent response at high contrasts, provided that the photon 
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flux is large enough to allow the development of high SNRs. 

The origin of the saturation behavior is actually two-fold. If one looks just 
at the 2nd order correlator term in the perturbation expansion, then because 
the time constants and spectral densities are renormalized in relation to the 
SNR, one finds that the average response to slow, constant velocity motion 
in fact saturates as the spectral density of contrast fluctuations in increased. 
Examination of higher terms in the perturbation expansion shows the same 
behavior, but as higher order terms become important the series sums to give 
the gradient comparison. The crucial point is that saturation is not just a static 
cutoff of the quadratic growth at low contrast, but rather a subtle combination of 
mechanisms in which the coefficients of the optimal computation must be reset 
in relation to the signal-to-noise ratio. This means that the entire response vs. 
contrast curve should change depending on the rms contrast to which the system 
is adapted, that the deviation from quadratic response should be associated 
with the onset of this adaptive behavior, and that this crossover point where 
adaptation begins should correspond to an rms contrast which provides an SNR 
near unity. 

5.4 Crossover and noise 

The details of the crossover from the correlator to gradient comparison may 
seem uninteresting, but in fact these details are essential for insuring that the 
reliability of the optimal estimator improves with the number of photoreceptors. 
We have seen that the a broad class of correlator models, even adapting correla- 
tor models, fail to give an effective noise level which declines with iV, and hence 
are qualitatively sub-optimal. We know that under some conditions the noise 
level in the fly's velocity estimate approaches the limit corresponding to the 
optimal estimator |^, |^, ^, ^, ^, including the factor of 1/N, but these tests 
are limited to relatively low SNR and the A^-dependence of the effective noise 
level has not been probed directly. Since the higher-order terms in the series 
( p3| ) play the decisive role in setting the iV-dependence of the noise level one 
might hope to find a clear qualitative signature of these terms in the response 
to properly chosen stimuli; this remains an open problem. 

5.5 Summary 

To summarize, the optimal motion estimator has many qualitative features in 
common with the neural motion estimator found in the fly's brain. Theory 
suggests that even some apparently conflicting observations may be understood 
in terms of the rich adaptive behavior of the optimal estimator. This adaptive 
behavior should be directly observable, and would constitute evidence for adap- 
tation to higher-order statistics rather than just adaptation to the DC level. 
Finally, there is the quantitative prediction that the onset of adaptation to 



28 



the signal-to-noise ratio and the concomitant departure from quadratic contrast 
dependence should begin at SNR = 1. 
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Appendix 

For completeness we give here the exact equivalent input noise at zero frequency 
in the renormalized correlator (pq). 



NinputitO = 0) = Aq- 



1 / .43 Ai 

Ai + — { A2 + — + ^ 
N \ R R^ 



(81) 



_ fdk P\M{k)\^Sc{k)Sc{k) 
^"-J^ [a{k) + b{kW ^ ^ 

A, = AD' J J — — \M{k)\'\Mik')\'Sc{k)Scik)Sc{k')Sc{k')k*k'' 

X |c^(fc)c^(fc') [(c(fc) + C{k')f - 4i?2fc2^/2j3|-l 

X {{c{k) + c{k')f (4c2(fc) + llc(fc)c(fc') + 4c2(fc')) 

- AD'^k'^k''^ (6c2(fc) + llc(fc)c(fc') + 6c2(fc') + 8D^k^k'^)}i8S) 

M = ^J^k^\M{krsl{k)SUk)T-\k) 

X {(a + bf{a + h + k^Df{a + 2k^DfY^ 
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+ Ahah^k^D + 8h'k^D + AQa^ik^Df + 72ab{k^Df 
+ 27b^{k^Df + 2Aa{k^Df + 20b{k^Df + A{k^Df] (84) 

A, = -j-wm?sc{k)sUk)e ,3(,^,)4(,2 + ,2) (85) 

where a(k) = k^D + f-^{k), b{k) ^ k^D + T-^{k) and c(fc) = a{k) + b{k). 
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